Analysis of Household Energy Usage for 3 Swiss Households

Data

This project examines the Electricity Consumption and Occupancy (ECO) dataset. The consists of energy measurements from 3 swiss households over a period of 8 months. For each household, the dataset consists of readings from a smart meter as well as the specific energy consumption of 7 different appliances. The data was aggregated so that each observation is the total consumption for one day from each household. All measurements were converted to kilowatt-hours (kWh).

The purpose of this analysis is to construct useful interactive visualizations to effectively answer the data science questions below. Specifically, the plotly and altair libraries are used to create interactive visualizations. A few “sketches” of the data are created with seaborn before constructing the final interactive charts.

Data Science Questions

  1. Which appliances consume the most energy on a daily basis? How does the relative consumption differ between households?
  2. Does total energy consumption change throughout the year? Does this differ between the three phases of energy consumption?

Data Preparation

The first step in our analysis is to prepare the data for analysis. In order to analyze the data, we need to get it into a useable format. The raw data is structured in separate files. For each household, there is a folder for the meter data and the plug data. The meter data contains the recordings from the meter for every day in a separate csv file. We combined the 600 separate csv files into a single dataframe using the following procedure:

  1. Read a single day’s data into a dataframe.
  2. For each csv file, calculate the daily energy usage in kilowatt hours (kWh)
    1. To convert to daily kWh: Daily energy consumption (kWh) = (Total watts used in a day / 1,000) / 3,600
  3. Combine all csv files into a single dataframe

From the smart meter data, we are left with the daily energy usage per household for every date. Other features such as current, voltage, and variance of the measurements were initially considered but discarded since they are not the focus of this analysis.

The second source of data were the “plugs” for each house. For every house, the energy usage of a specific appliance is recorded in a separate folder, with one csv per day. The appliances recorded were:

  1. Fridges
  2. Dryers
  3. Coffee Machines
  4. Kettles
  5. Washing machines
  6. PCs
  7. Freezers

As with the smart meter data, we converted the plug for each house into a single daily observation. For every date, plug, and household, we retrieved the energy usage in kWh using the calculation above and joined it to the smart meter dataset to match the respective date and household. The code used to manipulate the data is below.

Code
# For every subfolder in the ./data folder, if the suffix is _SM, read in every file as a dataframe
# For each file, add the columns names below:
# columns = ['powerallphases', 'powerl1', 'powerl2', 'powerl3', 'currentneutral', 'currentl1', 'currentl2', 
#           'currentl3', 'voltagel1', 'voltagel2', 'voltagel3', 'phaseanglevoltagel2l1', 'phaseanglevoltagel3l1', 
#           'phaseanglecurrentvoltagel1', 'phaseanglecurrentvoltagel2', 'phaseanglecurrentvoltagel3']
# Then, take the mean and SD of each column so there is only one row
# Append a data column that has the filename as a value
# Append a household column that has the household as the value, which occurs before the underscore (e.g. H6, H5, or H4)
# Join all rows together so that there is one dataframe with one day per each household

# * powerallphases: Sum of real power over all phases
# * powerl1: Real power phase 1
# * powerl2: Real power phase 2
# * powerl3: Real power phase 3
# * currentneutral: Neutral current
# * currentl1: Current phase 1
# * currentl2: Current phase 2
# * currentl3: Current phase 3
# * voltagel1: Voltage phase 1
# * voltagel2: Voltage phase 2
# * voltagel3: Voltage phase 3
# * phaseanglevoltagel2l1: Phase shift between voltage on phase 2 and 1
# * phaseanglevoltagel3l1: Phase shift between voltage on phase 3 and 1
# * phaseanglecurrentvoltagel1: Phase shift between current/voltage on phase 1
# * phaseanglecurrentvoltagel2: Phase shift between current/voltage on phase 2 
# * phaseanglecurrentvoltagel3: Phase shift between current/voltage on phase 3

import pandas as pd
import os

data_dir = './data'

# Initialize an empty list to store the dataframes
dfs = []


# Loop through each subfolder in the data directory
for folder in os.listdir(data_dir):
    if folder.endswith('_SM'):
        # Get the household ID from the folder name
        household = folder.split('_')[0]
        # Loop through each file in the subfolder
        for filename in os.listdir(os.path.join(data_dir, folder)):
            if filename.endswith(".csv"):
                # Read in the file as a dataframe
                df = pd.read_csv(os.path.join(data_dir, folder, filename), header=None, 
                                 names=['powerallphases', 'powerl1', 'powerl2', 'powerl3', 'currentneutral',
                                        'currentl1', 'currentl2', 'currentl3', 'voltagel1', 'voltagel2', 'voltagel3',
                                        'phaseanglevoltagel2l1', 'phaseanglevoltagel3l1', 'phaseanglecurrentvoltagel1',
                                        'phaseanglecurrentvoltagel2', 'phaseanglecurrentvoltagel3'])
                # Calculate the mean and standard deviation for each column
                df_mean = df.sum().to_frame().transpose()
                df_std = df.std().to_frame().transpose()
                # Add a column for the filename and household
                df_mean['date'] = filename.strip('.csv')
                df_std['date'] = filename.strip('.csv')
                df_mean['household'] = household
                df_std['household'] = household
                # Add the dataframes to the list
                this_df = pd.merge(df_mean, df_std, on=['date','household'], suffixes = ('_daily', '_std'))
                dfs.append(this_df)
                

# Concatenate all of the dataframes together
result = pd.concat(dfs, axis=0)

# Make date and household the first two columns
cols = ['date', 'household']
cols.extend([col for col in result.columns if col not in cols])
result = result.reindex(columns=cols)

# Covert the power to kilowatts - powerallphases_daily,powerl1_daily,powerl2_daily,powerl3_daily
result['powerallphases_daily'] = result['powerallphases_daily']/1000/3600
result['powerl1_daily'] = result['powerl1_daily']/1000/3600
result['powerl2_daily'] = result['powerl2_daily']/1000/3600
result['powerl3_daily'] = result['powerl3_daily']/1000/3600

# Drop all columns except date, household, powerallphases_daily, powerl1_daily, powerl2_daily, powerl3_daily
result = result[['date', 'household', 'powerallphases_daily', 'powerl1_daily', 'powerl2_daily', 'powerl3_daily']]

# Encode the date as a datetime object
result['date'] = pd.to_datetime(result['date'])

# Arrange by date and household
result = result.sort_values(by=['date', 'household'])

# Save the final dataframe to a CSV file
#result.to_csv('output_SM.csv', index=False)

# Sanity check - value counts for each household
result['household'].value_counts()

sm_data = result.copy()

The number of observations for each household in the dataset matches the number of days that they were recorded.

Code
# Plug data
# Each household (H4, H5, H6) has a folder with the suffix _plugs
# For each plug folder, there is a folder named 01 ... through 07. 
# Each of these folders has a separate csv for each day
# We need to read in each of these files and calculate the sum of the power in kWh by summing and dividing by 3600000
# Each plug should have a column with the date (from the filename) and household (from the folder)

#  Define a dictionary to map folder names to appliance types
appliance_map = {
    '01': 'fridge',
    '02': 'dryer',
    '03': 'coffee',
    '04': 'kettle',
    '05': 'washer',
    '06': 'pc',
    '07': 'freezer'
}

# Initialize an empty list to store the dataframes
dfs = []

# Loop through each household folder
for household_folder in os.listdir(data_dir):
    if household_folder.endswith('_plugs'):
        household = household_folder.split('_')[0]
        # Loop through each plug folder
        for plug_folder in os.listdir(os.path.join(data_dir, household_folder)):
            # Get the appliance type from the dictionary
            appliance = appliance_map.get(plug_folder, 'unknown')
            if plug_folder != ".DS_Store": # Ignore the .DS_Store file
                # Loop through each CSV file
                for filename in os.listdir(os.path.join(data_dir, household_folder, plug_folder)):
                    if filename.endswith('.csv'):
                        # Read in the CSV file as a dataframe
                        df = pd.read_csv(os.path.join(data_dir, household_folder, plug_folder, filename), names=['power'])
                        # Calculate the sum of power in kWh
                        df = df.sum().to_frame().transpose()
                        df['kWh'] = df['power']/3600000
                        # Add columns for the date, household, plug, and appliance type
                        df['date'] = pd.to_datetime(filename.split('.')[0])
                        df['household'] = household
                        df['plug'] = plug_folder
                        df['appliance'] = appliance
                        # Append the dataframe to the list
                        dfs.append(df)

# Concatenate all of the dataframes together
result = pd.concat(dfs)

# group by date and household, pivot by appliance
result = result.pivot(index=['date', 'household'], columns='appliance', values='kWh').reset_index()

# Save the final dataframe to a CSV file
#result.to_csv('output_plug.csv', index=False)

plugs = result.copy()
Code
# Keep only yyyy-mm-dd and remove hh:mm:ss
plugs['date'] = pd.to_datetime(plugs['date'].dt.date)
Code
# Join the sm data and plugs on the date and household, retaining all rows from both
df = pd.merge(sm_data, plugs, on=['date', 'household'], how='outer')
df = df.drop(columns=['unknown'])
Code
# Write the final dataframe to a CSV file
# Drop the unknown appliance column
df.to_csv("power_data_clean.csv", index=False)

Exploratory Data Analysis

Finally, we have a tidy dataset on the date/house level. The shape of the dataset is 632x14, with 32 days from the plugs that did not occur in the smart meter dataset. The remaining columns are:

  • date: The day of the measurement
  • household: The household that the measurement was taken from (H4, H5, or H6)
  • powerallphases_daily: (kWh) The total energy usage for the day
  • powerl1_daily: (kWh) The energy usage for phase 1
  • powerl2_daily: (kWh) The energy usage for phase 2
  • powerl3_daily: (kWh) The energy usage for phase 3
  • coffee: (kWh) The daily energy usage for the coffee machine
  • fridge: (kWh) The daily energy usage for the fridge
  • dryer: (kWh) The daily energy usage for the dryer
  • kettle: (kWh) The daily energy usage for the kettle
  • washer: (kWh) The dailyenergy usage for the washing machine
  • pc: (kWh) The daily energy usage for the PC
  • freezer: (kWh) The daily energy usage for the freezer

To begin the exploratory data analysis, we can examine summary statistics, look at some of the distributions of variables as well as how the variables are correlated in the dataset. Some static visualizations will guide our final interactive visualizations.

Code
df.describe()
powerallphases_daily powerl1_daily powerl2_daily powerl3_daily coffee dryer freezer fridge kettle pc washer
count 600.000000 600.000000 600.000000 600.000000 329.000000 597.000000 551.000000 578.000000 566.000000 560.000000 591.000000
mean 15.414557 7.070770 2.719145 5.624328 0.315968 0.170527 0.579788 0.258389 0.195285 0.290460 1.884256
std 8.999032 5.265384 1.944670 3.936660 0.285509 0.103838 0.373696 0.291260 0.194002 0.327534 1.784528
min 1.864760 0.432333 0.002444 0.774391 -0.008834 -0.019447 -0.018895 -0.020608 -0.022533 -0.019892 0.044193
25% 8.630140 1.924706 1.188225 2.286015 0.093377 0.124683 0.148998 0.000002 0.087050 0.031047 0.626971
50% 15.749312 6.967787 2.413147 4.950535 0.288394 0.142218 0.704999 0.111247 0.131475 0.222543 1.067496
75% 20.149719 10.595041 3.596908 8.370188 0.471979 0.199236 0.846771 0.586220 0.286475 0.325684 4.225914
max 61.602174 34.199160 17.380410 25.077918 1.334717 0.619953 1.323844 0.979461 1.078786 1.856928 6.239601

A few observations from the summary statistics:

  • There are the fewest observations from the coffee plug
  • The greatest energy usage occurs in phase 1 on average. Phase 1 also has the greatest standard deviation.
  • The plug with the greatest daily usage, on average, is the washer.

Next, we will create some rudimentary visualizations of the distributions below:

Code
# Use seaborn pairplot to examine distributions and relationships
import matplotlib.pyplot as plt
import seaborn as sns

sns.pairplot(df, hue='household')
<seaborn.axisgrid.PairGrid at 0x17068ea70>

Another rough visualization that could help direct our final product is a heatmap.

Code
# Create a heatmap to demonstrate the relationships between the variables
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
/var/folders/dm/hqy2vyf53pj_9vry3bc0lwrc0000gn/T/ipykernel_59707/654618798.py:3: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
<AxesSubplot: >

Development of interactive visualizations

Visualization 1: Which appliances consume the most energy on a daily basis? How does the relative consumption differ between households?

A visualization that demostrates the distributions of each appliance’s daily energy usage per each household could help us answer this question.

Code
# Save the tidy dataframe to another variable
tidy_df = pd.read_csv('power_data_clean.csv')

# Pivot the data longer so the appliance types are in a single column and the power is in a single column
df = tidy_df.melt(id_vars=['date', 'household'], var_name='appliance', value_name='kWh')

# Only keep the seven appliance types
df = df[df['appliance'].isin(['fridge', 'dryer', 'coffee', 'kettle', 'washer', 'pc', 'freezer'])]
Code
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Define a dictionary to map appliance names to colors
color_map = {
    'fridge': '#3A2E39',
    'washer': '#1E555C',
    'dryer': "#899795",
    'coffee': "#F4D8CD",
    'kettle': '#F1C5A8',
    'pc': '#EDB183',
    'freezer': '#F15152'
}

# Create the viz lists
# First list, every 7th value is true, the rest are false, 21 values total
viz_list1 = [True if i % 7 == 0 else False for i in range(21)]
viz_list2 = [True if i % 7 == 1 else False for i in range(21)]
viz_list3 = [True if i % 7 == 2 else False for i in range(21)] 
viz_list4 = [True if i % 7 == 3 else False for i in range(21)]
viz_list5 = [True if i % 7 == 4 else False for i in range(21)]
viz_list6 = [True if i % 7 == 5 else False for i in range(21)]
viz_list7 = [True if i % 7 == 6 else False for i in range(21)]



# Create a subplot for each household
fig = make_subplots(rows=1, cols=3, shared_yaxes='all', shared_xaxes='all',
                    subplot_titles=['House 4', 'House 5', 'House 6'],
                    x_title='Daily Energy Usage (kWh)', y_title='Number of Days')


# Function whether to show legend
def show_legend(i):
    if i%3 == 0:
        return True
    else:
        return False

# Loop through each household
for i, household in enumerate(['H4', 'H5', 'H6']):
    # Filter the dataframe by household
    household_df = df[df['household'] == household]

    # Create a histogram for each appliance
    for appliance in household_df['appliance'].unique():
        # Filter the dataframe by appliance
        appliance_df = household_df[household_df['appliance'] == appliance]

        # Add a histogram trace to the subplot
        fig.add_trace(
            go.Histogram(
                x=appliance_df['kWh'],
                name=appliance,
                marker_color=color_map.get(appliance, 'gray'),
                nbinsx=30,
                legendgroup=appliance,
                showlegend = show_legend(i)
            ),
            row=1, col=i+1
        )


# Update the subplot layout
fig.update_layout(
    title='Daily Energy Usage for All Appliances by Household',
    barmode='overlay',
    xaxis=dict(range=[-0.2, 6.2]),
    yaxis=dict(range=[0, 210])
)

# Add a dropdown menu to filter by appliance
fig.update_layout(
    updatemenus=[
        go.layout.Updatemenu(
            type='dropdown',
            showactive=True,
            buttons=[
                {'label': 'All', 'method': 'update', 'args': [{'visible': True}, {'title.text': 'Daily Energy Usage for All Appliances by Household'}]},
                {'label': 'Coffee', 'method': 'update', 'args': [{'visible': viz_list1}, {'title.text': 'Daily Energy Usage for Coffee Machine by Household'}]},
                {'label': 'Dryer', 'method': 'update', 'args': [{'visible': viz_list2}, {'title.text': 'Daily Energy Usage for Dryer by Household'}]},
                {'label': 'Freezer', 'method': 'update', 'args': [{'visible': viz_list3}, {'title.text': 'Daily Energy Usage for Freezer by Household'}]},
                {'label': 'Fridge', 'method': 'update', 'args': [{'visible': viz_list4}, {'title.text': 'Daily Energy Usage for Fridge by Household'}]},
                {'label': 'Kettle', 'method': 'update', 'args': [{'visible': viz_list5}, {'title.text': 'Daily Energy Usage for Kettle by Household'}]},
                {'label': 'PC', 'method': 'update', 'args': [{'visible': viz_list6}, {'title.text': 'Daily Energy Usage for PC by Household'}]},
                {'label': 'Washer', 'method': 'update', 'args': [{'visible': viz_list7}, {'title.text': 'Daily Energy Usage for Washer by Household'}]}
            ]
        )
    ]
)

fig.update_traces(xbins=dict( # bins used for histogram
        start=0.0,
        end=6.5,
        size=0.1
    ))

fig.update_traces(marker_line_width=1,marker_line_color="black",
                  opacity=0.8)

fig.update_layout(hovermode="x unified")
# Show the plot
fig.show()

I will highlight some of the specific choices made in the results section below. The above chart shows the number of days each compliance used that approximate amount of energy.

Visualization 2: Does total energy consumption change throughout the year? Does this differ between the three phases of energy consumption?

For the next visualization, I would like to examine how energy usage changes over the year by phase. We will start by aggregating across the three households by day and renaming some of the variables for clarity in the labels.

Code
# Pivot powerl1_daily   powerl2_daily   powerl3_daily to long format and create a "phase" column
df_phase = pd.melt(tidy_df, id_vars=['date', 'household'], var_name='phase', value_name='kWh')

# Only keep the 'powerl1_daily', 'powerl2_daily', and 'powerl3_daily' and powerallphases_daily rows
df_phase = df_phase[df_phase['phase'].isin(['powerl1_daily', 'powerl2_daily', 'powerl3_daily'])]

# Start by averaging each date
df_phase = df_phase.groupby(['date', 'phase']).mean().reset_index()

# Remove the one NA date
df_phase = df_phase.dropna()

# Show head
df_phase.head()

# Recode the phases to be Phase 1, Phase2, and Phase3
df_phase['phase'] = df_phase['phase'].replace({'powerl1_daily': 'Phase 1', 'powerl3_daily': 'Phase 3', 'powerl2_daily': 'Phase 2'})
/var/folders/dm/hqy2vyf53pj_9vry3bc0lwrc0000gn/T/ipykernel_59707/635431084.py:8: FutureWarning:

The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.

Next, we will construct the chart using altair’s linked views.

Code
import altair as alt


# Interval for date selection
interval = alt.selection_interval(encodings=['x'])

# Y axis scale
yscale = alt.Scale(domain=(0, 22))

# Define custom color scheme
color_scale = alt.Scale(domain=['Phase 1', 'Phase 2', 'Phase 3'], range=['#F1C5A8', '#F15152', '#1E555C'])


# Base chart - usage of phases over time
base = alt.Chart(df_phase).mark_line(size=2).encode(
    x=alt.X('date:T', title = "Date"),
    y=alt.Y('kWh:Q', scale = yscale),
    color = alt.Color('phase:N', title = "Phase", scale=color_scale)
)

# Top chart - zoomed in on selected interval
chart = base.encode(
    x=alt.X('date:T', scale=alt.Scale(domain=interval.ref()), title = "Date"),
    y=alt.Y('kWh:Q', title = "Daily Energy Usage (kWh)")
).properties(
    width=800,
    height=300,
    title = "Average Daily Energy Usage of Three Households July 2012 - Jan 2013"
)

# Bottom chart - brush to select interval
view = base.add_selection(
    interval
).properties(
    width=800,
    height=50,
    title = "Click and Drag Below to Select a Time Interval:"
)

# Add a histogram that reflects the mean of each phase during that interval
hist = (
    base.mark_bar()
    .encode(
        y=alt.Y("kWh:Q", aggregate='mean', stack=None, title="Mean Energy (kWh)", scale=yscale),
        x=alt.X('phase:N', stack=None, title="Phase", axis = alt.Axis(labelAngle=-45)),
        color=alt.Color("phase:N", scale=color_scale),
    )
    .transform_filter(interval)
    .properties(
        width=100,
        height=400,
        title = "Mean Energy Usage (kWh) in Selection"
    )
)

alt.hconcat(hist, alt.vconcat(chart, view), center = True)
/Users/bren/opt/anaconda3/envs/anly503/lib/python3.10/site-packages/altair/utils/core.py:317: FutureWarning:

iteritems is deprecated and will be removed in a future version. Use .items instead.

Results

Now I will explain the specific design choices that went into designing the two final visualizations.

Visualization 1: Daily Energy Usage By Appliance for Each Household

Code
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Define a dictionary to map appliance names to colors
color_map = {
    'fridge': '#3A2E39',
    'washer': '#1E555C',
    'dryer': "#899795",
    'coffee': "#F4D8CD",
    'kettle': '#F1C5A8',
    'pc': '#EDB183',
    'freezer': '#F15152'
}

# Create the viz lists
# First list, every 7th value is true, the rest are false, 21 values total
viz_list1 = [True if i % 7 == 0 else False for i in range(21)]
viz_list2 = [True if i % 7 == 1 else False for i in range(21)]
viz_list3 = [True if i % 7 == 2 else False for i in range(21)] 
viz_list4 = [True if i % 7 == 3 else False for i in range(21)]
viz_list5 = [True if i % 7 == 4 else False for i in range(21)]
viz_list6 = [True if i % 7 == 5 else False for i in range(21)]
viz_list7 = [True if i % 7 == 6 else False for i in range(21)]



# Create a subplot for each household
fig = make_subplots(rows=1, cols=3, shared_yaxes='all', shared_xaxes='all',
                    subplot_titles=['House 4', 'House 5', 'House 6'],
                    x_title='Daily Energy Usage (kWh)', y_title='Number of Days')


# Function whether to show legend
def show_legend(i):
    if i%3 == 0:
        return True
    else:
        return False

# Loop through each household
for i, household in enumerate(['H4', 'H5', 'H6']):
    # Filter the dataframe by household
    household_df = df[df['household'] == household]

    # Create a histogram for each appliance
    for appliance in household_df['appliance'].unique():
        # Filter the dataframe by appliance
        appliance_df = household_df[household_df['appliance'] == appliance]

        # Add a histogram trace to the subplot
        fig.add_trace(
            go.Histogram(
                x=appliance_df['kWh'],
                name=appliance,
                marker_color=color_map.get(appliance, 'gray'),
                nbinsx=30,
                legendgroup=appliance,
                showlegend = show_legend(i)
            ),
            row=1, col=i+1
        )


# Update the subplot layout
fig.update_layout(
    title='Daily Energy Usage for All Appliances by Household',
    barmode='overlay',
    xaxis=dict(range=[-0.2, 6.2]),
    yaxis=dict(range=[0, 210])
)

# Add a dropdown menu to filter by appliance
fig.update_layout(
    updatemenus=[
        go.layout.Updatemenu(
            type='dropdown',
            showactive=True,
            buttons=[
                {'label': 'All', 'method': 'update', 'args': [{'visible': True}, {'title.text': 'Daily Energy Usage for All Appliances by Household'}]},
                {'label': 'Coffee', 'method': 'update', 'args': [{'visible': viz_list1}, {'title.text': 'Daily Energy Usage for Coffee Machine by Household'}]},
                {'label': 'Dryer', 'method': 'update', 'args': [{'visible': viz_list2}, {'title.text': 'Daily Energy Usage for Dryer by Household'}]},
                {'label': 'Freezer', 'method': 'update', 'args': [{'visible': viz_list3}, {'title.text': 'Daily Energy Usage for Freezer by Household'}]},
                {'label': 'Fridge', 'method': 'update', 'args': [{'visible': viz_list4}, {'title.text': 'Daily Energy Usage for Fridge by Household'}]},
                {'label': 'Kettle', 'method': 'update', 'args': [{'visible': viz_list5}, {'title.text': 'Daily Energy Usage for Kettle by Household'}]},
                {'label': 'PC', 'method': 'update', 'args': [{'visible': viz_list6}, {'title.text': 'Daily Energy Usage for PC by Household'}]},
                {'label': 'Washer', 'method': 'update', 'args': [{'visible': viz_list7}, {'title.text': 'Daily Energy Usage for Washer by Household'}]}
            ]
        )
    ]
)

fig.update_traces(xbins=dict( # bins used for histogram
        start=0.0,
        end=6.5,
        size=0.1
    ))

fig.update_traces(marker_line_width=1,marker_line_color="black",
                  opacity=0.8)

fig.update_layout(hovermode="x unified")
# Show the plot
fig.show()

The goal of this visualization was to determine how much power each appliance was using. Further, I wanted to separate the graphs by household in case there were significant differences in usage between the households. In order to gain insight into how much energy each appliance was using, I encoded each appliance to a different color, and each household to a different subplot. I found that this would be more effective than the opposite: having house encoded by color and each appliance having its own facet. Since there are only 3 households - by extension 3 facets - this makes for a more compact visualization.

The interactive features are used to enhance the communication of information. The dropdown menu allows the user to filter the data by appliance, allowing for a clearer view of each histogram. This is especially useful for comparing the appliance usage between households. While having all appliances plotted at once can help one determine the general differences in household usage, viewing separate appliances can facilitate comparisons on that level. For example, if we select “washer”, we can determine that one household utilizes their washer very heavily (house 4), wheres the other two tend to have more normally distributed usages. We can also see that house 4 most likely does not use a pc, since the usage is 0 kWh for every day.

Finally, there is a tooltip feature that is useful for comparing between appliances. If we hover over any portion of the graph, we are able to get the histogram counts for each of the appliances.

Overall, this visualization is effective for comparing energy usage between appliances and households.

Visualization 2: Energy usage by phase over time

Code
import altair as alt
alt.data_transformers.enable('default',max_rows=None)


# Interval for date selection
interval = alt.selection_interval(encodings=['x'])

# Y axis scale
yscale = alt.Scale(domain=(0, 22))

# Define custom color scheme
color_scale = alt.Scale(domain=['Phase 1', 'Phase 2', 'Phase 3'], range=['#F1C5A8', '#F15152', '#1E555C'])


# Base chart - usage of phases over time
base = alt.Chart(df_phase).mark_line(size=2).encode(
    x=alt.X('date:T', title = "Date"),
    y=alt.Y('kWh:Q', scale = yscale),
    color = alt.Color('phase:N', title = "Phase", scale=color_scale)
)

# Top chart - zoomed in on selected interval
chart = base.encode(
    x=alt.X('date:T', scale=alt.Scale(domain=interval.ref()), title = "Date"),
    y=alt.Y('kWh:Q', title = "Daily Energy Usage (kWh)")
).properties(
    width=800,
    height=300,
    title = "Average Daily Energy Usage of Three Households by Phase July 2012 - Jan 2013"
)

# Bottom chart - brush to select interval
view = base.add_selection(
    interval
).properties(
    width=800,
    height=50,
    title = "Click and Drag Below to Select a Time Interval:"
)

# Add a histogram that reflects the mean of each phase during that interval
hist = (
    base.mark_bar()
    .encode(
        y=alt.Y("kWh:Q", aggregate='mean', stack=None, title="Mean Energy (kWh)", scale=yscale),
        x=alt.X('phase:N', stack=None, title="Phase", axis = alt.Axis(labelAngle=-45)),
        color=alt.Color("phase:N", scale=color_scale),
    )
    .transform_filter(interval)
    .properties(
        width=100,
        height=400,
        title = "Mean Energy Usage (kWh) in Selection"
    )
)

#hist | (chart & view) 
alt.hconcat(hist, alt.vconcat(chart, view), center = True)

The second visualization depicts energy usage over time for each phase. The total power consumption is measured as the sum of the three phases. The three phases could be helpful to monitor for energy professionals. My understanding is that in a three-phase system, balancing the load across the three phases is important. Here, we can see that the phases are not equally loaded, and that phase 1 tends to be the highest. There are some periods where phase 3 is the highest, and briefly the lowest. More detailed interpretation would need to be conducted by someone with more domain knowledge. The other (more general) function of this graph is to see how energy consumption changes over time. We can see that the month of december has the highest energy usage, which could be due to electric heating.

There are two primary interactive features: The selection at the bottom, and the reactive plot on the left. The selection at the bottom allows the user to select a time period of interest, and the units on the larger chart above update automatically. This permits for more granular visualization - although in this case, we have aggregated to the daily level, so that is the smallest unit. The reactive plot calculates the mean of each phase during the time period. This is helpful for the user to understand the approximate levels of energy used by each phase during that period.

Conclusion

The first visualization communicated the energy usage of individual appliances in each household. Interactivity and faceting were used to facilitate comparisons between households and appliances. The second visualization reflected the energy usage for each phase over time, and combined a subsetting view with a reactive histogram. This second visualization was helpful for understanding the trends in energy usage over custom time periods.

References

  1. Christian Beckel, Wilhelm Kleiminger, Romano Cicchetti, Thorsten Staake, and Silvia Santini The ECO Data Set and the Performance of Non-Intrusive Load Monitoring Algorithms. Proceedings of the 1st ACM International Conference on Embedded Systems for Energy-Efficient Buildings (BuildSys 2014). Memphis, TN, USA. ACM, November 2014.
  2. Wilhelm Kleiminger, Christian Beckel, Silvia Santini Household Occupancy Monitoring Using Electricity Meters. Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp 2015). Osaka, Japan, September 2015.